Introduction to Distributional Regression

Lecture 4

Gillian Heller

NHMRC Clinical Trials Centre, University of Sydney

gillian.heller@sydney.edu.au

Fitting a distributional regression model

We need to

  1. Select the response distribution

  2. Select covariates for all of the model parameters of the chosen distribution

  3. Check model fit

  4. Go back to 1.

3. Check model fit

Residuals

  • Residuals of the normal linear model are based on the differences

y_i-\hat y_i

  • Suitably standardised, these are approx normally distributed if model assumptions are correct

  • This doesn’t extend to non-normal response distributions

  • We use quantile residuals (or probability integral transform (PIT) residuals)

Quantile residuals

  • For continuous random variable y with true cdf F(\cdot)

F(y_i)\sim\mathcal{U}(0,1) (This is a well-known result.)

  • We fit a model and obtain parameter estimates (\hat\theta_{i1},\ldots,\hat\theta_{iK})

  • Compute

\hat u_i=F\left(y_i|\hat\theta_{i1},\ldots,\hat\theta_{iK}\right)

  • If the model is correct then the \hat u_i should be \,\mathcal{U}(0,1) distributed
  • Because we are used to testing residuals for normality, we apply the normalisation:

\hat r_i=\Phi^{-1}(\hat u_i)

  • The \hat r_i are the normalized quantile residuals, which we can

    • test for normality

    • use for identification of outlying observations

Discrete response distribution: randomized quantile residuals

  • For discrete distributions, the cdf F(\cdot) is a step function and the \hat u_i’s are discrete
  • We apply a randomisation (jittering) to the \hat u_i’s (before normalisation) so that they are uniformly distributed
  • This means that randomized quantile residuals will be (slightly) different each time they are computed

    • This can be fixed by setting the randomisation seed (set.seed())

Worm plot

  • The worm plot is a detrended version of the QQ plot

  • Lack of fit in the tails is highlighted better

  • Can be split by covariate for identifying where lack of fit is occurring

Dutch boys BMI

Recall the full data set

We fit the following model

m.BCTo <- gamlss(bmi ~ pb(age),
                 sigma.formula =~ pb(age),
                 nu.formula =~ pb(age),
                 tau.formula =~ pb(age),
                 data=dbbmi, family=BCTo, trace = FALSE)

pb() = P-spline

Dutch boys BMI

wp(m.BCTo, ylim.all=0.5)

The worm plot indicates a mostly well-fitting model.

Dutch boys BMI

Worm plot by xvar

The following divides age into 6 levels with approx equal numbers of observations:

wp(m.BCTo, xvar=age, n.inter=6)
number of missing points from plot= 1  out of  1222 
number of missing points from plot= 3  out of  1209 
number of missing points from plot= 1  out of  1217 
number of missing points from plot= 4  out of  1215 
number of missing points from plot= 0  out of  1215 
number of missing points from plot= 1  out of  1216 

Dutch boys BMI

Worm plot by xvar

The following specifies cutpoints for age:

wp(m.BCTo, xvar=age, xcut.points=c( 5, 10, 15))
number of missing points from plot= 3  out of  2689 
number of missing points from plot= 2  out of  784 
number of missing points from plot= 8  out of  1939 
number of missing points from plot= 0  out of  1882 

Comparative worm plots

Another model for BMI: Gamma (GA) distribution

m.GA <- gamlss(bmi ~ pb(age),
                 sigma.formula =~ pb(age),
                 data=dbbmi, family=GA, trace = FALSE)
model_wp(m.GA, m.BCTo)

Model comparisons using GAIC

  • Default is k=2 i.e. AIC
GAIC(m.BCTo, m.GA)
            AIC       df
m.BCTo 29225.49 35.98429
m.GA   29780.39 28.10985


  • BIC
GAIC(m.BCTo, m.GA, k = log(nrow(dbbmi)))
            AIC       df
m.BCTo 29473.59 35.98429
m.GA   29974.20 28.10985

Dutch boys BMI : Fitted values plot

For the BCTo distribution, \mu is the median.

pred <- predict(m.BCTo, what="mu", type = "response")
dbbmi <- cbind(dbbmi, pred)

ggplot(dbbmi, aes(x=age, y=bmi)) +
  geom_point(shape = 21, fill = "lightgray", color = "black", size = 0.5) +
  geom_smooth(aes(x=age, y=pred), size=0.5, colour = "blue") +
  theme_bw()